In [ ]:
%pylab inline
import utils
import scipy.stats
from sklearn.preprocessing import PolynomialFeatures

try:
    from linear_model import BayesianLinearModel
except:
    msg = "'The 'BayesianLinearModel' module does not appear to be installed."
    raise Exception(msg)

# Use same random data (for reproducibility).
np.random.seed(42)

Bayesian Linear Regression

Authors: Asher Bender
$$ \DeclareMathOperator{\absoluteVal}{abs} \renewcommand{\det} [1]{{\begin{vmatrix}#1\end{vmatrix}}} \newcommand{\brac} [1]{{ \left( #1 \right) }} \newcommand{\sbrac} [1]{{ \left[ #1 \right] }} \newcommand{\cbrac} [1]{{ \left{ #1 \right} }} \newcommand{\AutoMiddle}{{\;\vert\;}} \newcommand{\prob} [2][p]{{#1\!\brac{#2}}} \newcommand{\condprob} [2]{p\!\brac{#1\AutoMiddle#2}} \newcommand{\observations}{{\mathcal{D}}} \newcommand{\sinput}{{x}} \newcommand{\vinput}{{\mathbf{x}}} \newcommand{\minput}{{\mathbf{X}}} \newcommand{\soutput}{{y}} \newcommand{\voutput}{{\mathbf{y}}} \newcommand{\moutput}{{\mathbf{Y}}} \newcommand{\starget}{{t}} \newcommand{\vtarget}{{\mathbf{t}}} \newcommand{\mtarget}{{\mathbf{T}}} \newcommand{\weight}{{w}} \newcommand{\weights}{{\mathbf{\weight}}} \newcommand{\params}{{\mathbf{\theta}}} \newcommand{\design}{{\Phi}} \newcommand{\sbasis}{{\phi}} \newcommand{\vbasis}{{\mathbf{\phi}}} \newcommand{\snoise}{{\epsilon}} \newcommand{\vnoise}{{\mathbf{\snoise}}} \newcommand{\sbasisfcn}[1]{\sbasis\!\brac{#1}} \newcommand{\vbasisfcn}[1]{\vbasis\!\brac{#1}} \newcommand{\variance}{{\sigma^2}} \newcommand{\precision}{{\lambda}} \newcommand{\normal}{{\mathcal{N}}} \newcommand{\normalmean}{{\mathbf{\mu}}} \newcommand{\normalvariance}{{\mathbf{V}}} \newcommand{\normalprecision}{{\mathbf{S}}} \newcommand{\normaldist}[2]{\normal\brac{#1, #2}} \newcommand{\normalcond}[3]{\normal\brac{#1\AutoMiddle#2, #3}} \newcommand{\gammashape}{{\alpha}} \newcommand{\gammascale}{{\beta}} \newcommand{\gammarate}{{\beta}} \newcommand{\gammasym}{{\mathcal{G}}} \newcommand{\gammadist}[2]{\gammasym\brac{#1, #2}} \newcommand{\gammacond}[3]{\gammasym\brac{#1\AutoMiddle#2, #3}} \newcommand{\igammadist}[2]{i\gammasym\brac{#1, #2}} \newcommand{\igammacond}[3]{i\gammasym\brac{#1\AutoMiddle#2, #3}} \newcommand{\normaligammadist}[4]{\normal\!i\gammasym\brac{#1, #2, #3, #4}} \newcommand{\normaligammacond}[5]{\normal\!i\gammasym\brac{#1\AutoMiddle#2, #3, #4, #5}} \newcommand{\normalgammadist}[4]{\normal\gammasym\brac{#1, #2, #3, #4}} \newcommand{\normalgammacond}[5]{\normal\gammasym\brac{#1\AutoMiddle#2, #3, #4, #5}} \newcommand{\gammafcn}[1]{\Gamma\!\brac{#1}} \newcommand{\data}{\mathcal{D}} \newcommand{\params}{\mathbf{\theta}} $$

Linear Model

Consider a linear model that produces target outputs, $\soutput$, by linearly combining the model parameters, $\weight_i$, and the input variables, $\sinput_i$:

\begin{equation} \soutput\brac{\vinput, \weights} = \weight_0 + \weight_1\sinput_1 + \ldots + \weight_D\sinput_D. \end{equation}

The linear model can be extended to produce outputs that are a linear combination of fixed nonlinear functions of the input variables. These functions are known as basis functions. The linear model can now be expressed as

\begin{equation} \soutput\brac{\vinput, \weights} = \weight_0\sbasis_0\brac{x} + \weight_1\sbasis_1\brac{\sinput} + \ldots + \weight_M\sbasis_M\brac{\sinput} \end{equation}

For a set of observations,

\begin{equation} \voutput = \design\weights \end{equation}

where:

  • $\voutput = \brac{\soutput_1, \ldots, \soutput_N}^T$ are the observed scalar outputs.
  • $\minput = \brac{\vinput_1, \ldots, \vinput_N}^T$ are the $M$-dimensional inputs.
  • $\weights = \brac{\weight_1, \ldots, \weight_M}^T$ is the vector of model parameters.
  • $\design$ is the design matrix where the input data, $\minput$, has been passed through the vector of nonlinear basis functions $\vbasis = \brac{\sbasis_1, \ldots, \sbasis_D}$.

The dimensions involved in the linear model are shown below

\begin{equation} \begin{bmatrix} \soutput_1 \\ \vdots \\ \soutput_N \\ \end{bmatrix} = \begin{bmatrix} \sbasis_0\!\brac{\vinput_1} & \ldots & \sbasis_D\!\brac{\vinput_1} \\ \vdots & \ddots & \vdots \\ \sbasis_0\!\brac{\vinput_N} & \ldots & \sbasis_D\!\brac{\vinput_N} \\ \end{bmatrix} \begin{bmatrix} \weight_0 \\ \vdots \\ \weight_D \\ \end{bmatrix} \end{equation}

Example

The example shown below is created by sampling from the uniform distribution $U\brac{\vinput|-1, 1}$, substituting these values into the function $f\brac{\vinput} = -0.3 + 0.5\vinput$


In [ ]:
# Create hidden linear model.
w_true = [-0.3, 0.5]
polybasis = lambda x, p: PolynomialFeatures(p).fit_transform(x)
linear_model = lambda x, w=w_true: polybasis(x, len(w) - 1).dot(w).reshape(len(x), 1)

utils.plot(({'x': np.linspace(-1., 1.)[:, None], 'model': linear_model},))

Least squares parameter estimation

Supposing the form of the model is known but the model parameters are unknown. Following from the example above, we assume the model takes the form

\begin{equation} f\brac{\vinput} = b + m\vinput \end{equation}

Although the form of the model is known, the model parameters are treated as unknown. The task is to estimate the slope $m$ and intercept $b$ from the data. In this scenario we will also assume the data is perturbed by an unknown noise.


In [ ]:
# Make noisy observations of model.
N = 1000
noise = 0.2
X = np.random.uniform(-1.0, 1.0, size=(N, 1)).reshape((N, 1))
y = linear_model(X) + np.random.normal(scale=noise, size=(N, 1)) 

utils.plot({'x': X, 'y': y, 'linestyle': '', 'marker': '.', 'markersize': 2, 'color': 'k'})

More generally, the task is to estimate the model weights $\weights$ from the linear model:

\begin{equation} \vtarget = \design\weights + \vnoise \end{equation}

The least-squares solution can be obtained by finding the parameters that minimise the squared error between the model and observed values. These parameters are given by:

\begin{equation} \weights_{ml} = (\design^T\design)^{-1}\design^T\vtarget \end{equation}

In [ ]:
# Note that numpy provides a function for implementing the above equation.   
# 'lstsq' returns the least-squares solution to a linear matrix equation.
w_ml = np.linalg.lstsq(polybasis(X, len(w_true) - 1), y)[0]
model = lambda x: linear_model(x, w=w_ml)

utils.plot(({'x': X, 'y': y, 'linestyle': '', 'marker': '.', 'markersize': 2, 'color': 'k'},
            {'x': X, 'model': model, 'color': 'r', 'linewidth': 2}))

In [ ]:
print "'True' coefficients:        {}".format(w_true)
print "Least-squares coefficients: {}".format(w_ml.squeeze())

Bayesian Linear model

If the observations are perturbed by independent and identically distributed (i.i.d.) Gaussian noise with a zero mean and a precision of $\precision$, an observed output can be modelled as a Gaussian random variable where

\begin{equation} \starget = \weights^T \sbasisfcn{\vinput} + \vnoise, \end{equation}

and

\begin{equation} \condprob{\starget}{\vinput, \weights, \precision} = \normalcond{\starget}{\weights^T \sbasisfcn{\vinput}}{\precision^{-1}}. \end{equation}

Assuming Gaussian noise, the above equation states that the expected output value at a nominated input location, $\vinput$, is a function of the basis functions, $\vbasis$, and model parameters, $\weights$. The variance in observations is inversely proportional to the precision of the Gaussian, $\precision^{-1}$. Often these variables are unknown. In such cases, fitting a linear model to the data requires finding the parameters $\params = \brac{\weights, \precision}$ that optimally model the relationship between the observed input and output data.

Parameter estimation

The Bayesian approach to estimating the model parameters, $\params$, is to place a distribution over these variables. For a Gaussian with an unknown mean and covariance, the normal-gamma distribution can be used.

\begin{equation} \prob{\weights, \precision} = \normalgammacond{\weights, \precision} {\normalmean}{\normalprecision} {\gammashape}{\gammarate} \end{equation}

Since the normal-gamma distribution forms a conjugate prior, closed form expressions can be drived for the distribution updates where:

\begin{equation} \condprob{\weights, \precision}{\minput, \voutput} = \normalgammacond{\weights, \precision} {\normalmean_N}{\normalprecision_N} {\gammashape_N}{\gammarate_N} \end{equation}

and the sufficient statistics of the update are given by:

\begin{align} \normalmean_N &= \normalprecision_N \brac{\normalprecision^{-1}_0\normalmean_0 + \design^T\voutput} \\ \normalprecision^{-1}_N &= \normalprecision^{-1}_0 + \design^T\design \\ \gammashape_N &= \gammashape_0 + \frac{n}{2} \end{align}\begin{equation} \gammarate_N = \gammarate_0 + \frac{k}{2} \brac{\normalmean_0^T\normalprecision_0^{-1}\normalmean_0 + \voutput^T\voutput - \normalmean_N^T\normalprecision_N^{-1}\normalmean_N} \end{equation}

These equations allow the distribution over the model parameters to be updated sequentially as data is observed.

Example

This section replicates the examples shown in Figure 3.7 of [1] and Figure 7.11 of [2]. The example has been extended to the case where the precision is unknown. The example demonstrates sequential Bayesian updating using a simple two parameter model of the form $\voutput\brac{\vinput, \weights} = w_0 + w_1\vinput$. Since this model only has two parameters, the posterior can be visualised easily.

As in the previous example, the input data is created by sampling from the uniform distribution $U\brac{\vinput|-1, 1}$. The input data is substituted into the function $f\brac{\vinput} = -0.3 + 0.5\vinput$ and Gaussian noise, with a standard deviation of 0.2, is added to the output.

Initialisation

The distribution is initialised with an uninformative and improper prior where:

\begin{align} \normalmean_0 &= \mathbf{0} \\ \normalprecision^{-1} &= \infty^{-1}\mathbf{I} \\ \gammashape_0 &= \frac{-D}{2} \\ \gammarate_0 &= 0 \\ \end{align}

Since the prior is improper, it is only defined for $D < N - 1$. In the 2D example, the model can only be updated using more than three observations.


In [ ]:
# Create Bayesian linear model.
basis = lambda x: polybasis(x, 1)
blm = BayesianLinearModel(basis=basis)

# Perform update.
blm.update(X[:4], y[:4])
utils.plot_update(X[:4], y[:4], linear_model, blm)

10 observations


In [ ]:
blm.update(X[4:10], y[4:10])
utils.plot_update(X[:10], y[:10], linear_model, blm)

100 observations


In [ ]:
blm.update(X[10:100], y[10:100])
utils.plot_update(X[:100], y[:100], linear_model, blm)

1000 observations


In [ ]:
blm.update(X[100:], y[100:])
utils.plot_update(X, y, linear_model, blm)

Note that as more observations are added, the posterior distribution collapses on the correct estimates of the model parameters. Since an uniformative prior was used, the maximum a posteriori estimates of the model weights are identical to the maximum likelihood solution. This can be confirmed by comparing the Bayesian solution to the maximum likelihood solution:


In [ ]:
print "'True' coefficients:        {}".format(w_true)
print "Least-squares coefficients: {}".format(w_ml.squeeze())
print "Bayesian coefficients:      {}".format(blm.location.squeeze())

Model Selection

It is important to note that the term 'linear model' describes linearity in the model parameters. That is, the output is a linear combination of the model weights and the inputs (or design matrix produced by basis function expansion). For example, complex non-linear outputs can be constructed using high degree polynomial basis functions:


In [ ]:
N = 50
noise = 0.25
X = np.sort(np.random.uniform(0, 2*np.pi, N)).reshape((N, 1))
y = np.sin(X) + np.random.normal(scale=noise, size=(N, 1)) 

# Approximate Sin function with a 6-degree polynomial.
w_ml = np.linalg.lstsq(polybasis(X, 6), y)[0]
model = lambda x: linear_model(x, w=w_ml)

utils.plot(({'x': X, 'y': y, 'linestyle': '', 'marker': '.', 'markersize': 2, 'color': 'k'},
            {'x': X, 'model': model, 'color': 'r', 'linewidth': 2}))

If the model is known, the parameters can be estimated using least-squares or Bayesian updates. The previous section showed that under a (semi-conjugate) uninformative prior, these two methods produce the same estimates. As the complexity of the data increases, specifying a model may be difficult.

Bayesian methods have a distinct advantage over maximum-likelihood methods during model selection. The problem with using maximum likelihood for model selection is that it favours complex models. As the complexity of the model is increased, the extra degrees of freedom allow the model to more precisely fit the training data. Despite producing models which reflect the training data accurately, the models usually have poor generalisation and reflect noise in the data - these models are examples of 'over-fitting'.

Bayesian methods differ to maximum likelihood model selection in that they display a preference for simpler models. Bayesian model selection is performed by calculating the marginal likelihood also known as the model evidence. For a specified model, this is done by marginalising the model parameters and calculating the probability of the data:

\begin{equation} \condprob{\data}{m} = \int \condprob{\data}{\params}\condprob{\theta}{m}d\theta \end{equation}

For the Bayesian linear model, the marginal likelihood is given by:

\begin{equation} \prob{\data} = \frac{1}{2\pi^\frac{D}{2}} \frac{{\gammascale_0}^{\gammashape_0}} {{\gammascale_N}^{\gammashape_N}} \frac{\gammafcn{\gammashape_N}} {\gammafcn{\gammashape_0}} \frac{\det{\normalprecision_N}^\frac{1}{2}} {\det{\normalprecision_0}^\frac{1}{2}} \end{equation}

Example

In this example, polynomials of increasing complexity are fit to the trigonometric data shown in the previous figure. Trigonometric functions are complex non-linear functions which cannot be exactly represented using a linear model. Despite this, a polynomial of sufficient complexity can reasonably approximate the observed data. This leaves the open question: how many degrees are sufficient? Model selection can be used to answer this question.


In [ ]:
num_plots = 15
num_cols = 3
num_rows = np.ceil(float(num_plots)/num_cols)

# Interate through polynomial degrees and plot linear models.
MSE = np.zeros(num_plots)
ML = np.zeros(num_plots)
fig = plt.figure(figsize=(5 * num_cols, 5 * num_rows))
fig.subplots_adjust(hspace=0.6)
for p in range(0, num_plots):

    # Determine the maximum likelihood weights and evaluate the model.
    theta = polybasis(X, p)    
    w_ml = np.linalg.lstsq(theta, y)[0]
    fQuery = theta.dot(w_ml)
    MSE[p] = np.mean((fQuery - y)**2)
    
    # Create Bayesian linear model.
    blm = BayesianLinearModel(basis=lambda x: polybasis(x, p))
    blm.update(X, y)
    ML[p] = blm.evidence()
    
    # Plot ML/Bayesian linear models.
    plt.subplot(num_rows, num_cols, p + 1)
    utils.plot(({'x': X, 'y': y, 'linestyle': '', 'marker': '.', 'markersize': 2, 'color': 'k'},
                {'x': X, 'y': fQuery, 'color': 'b', 'linestyle': '--', 'linewidth': 2},
                {'x': X, 'y': blm.predict(X), 'color': 'r', 'linewidth': 2}))
    plt.xlabel('x')
    plt.ylabel('f(x)')
    plt.title('Polynomial degrees: ' + str(p))

Again, note that if an uninformative prior is used, the Bayesian solution is identical to the maximum likelihood solution (least-squares). The plots above show that as the model complexity increases, the model provides a closer fit to the data.

The mean squared error decreases as the model complexity increases. This is shown in the figure below. This behaviour heavily biases the maximum liklihood method towards over-fitting. Note that the rate at which the mean square error decreases has plateaued by three degrees of freedom. Conversely, the log marginal likelihood, returned by the Bayesian solution, displays a distinct peak at three degrees of freedom. Beyond three degrees of freedom, model complexity is penalised more heavily than data-fit, leading to deminishing log marginal likelihood.


In [ ]:
utils.plot_model_fit(MSE, ML)

Empirical Bayes

When, the model parameters cannot be analytically marginalised out, closed form parameter estimation cannot be used. Instead, the posterior on the hyper-parameters can be approximated with a point estimate. This is done by maximising the marginal likelihood of the data with respect to the hyper-parameters. This is approach is called empirical Bayes, type-II maximum likelihood or the evidence procedure. Compared to the fully Bayesian solution, empirical Bayes violates the principle that the prior should be nominated before any data are observed. A level of Bayesian inference is sacrificed for tractability.

Example

Consider the following sigmoid function:

\begin{equation} f\!\left(x\right) = \frac{A}{1 + e^{B\left(-x + C\right)} - D} \end{equation}

where:

  • $A$ controls the output scale
  • $B$ controls the rate/slope of the output
  • $C$ controls $x$-offset of the output
  • $D$ controls the $y$-offset of the output

The output of the sigmoid function is clearly a non-linear function of the input parameters. Closed form parameter estimation can not be used to find values for the sigmoid parameters. Instead, empirical Bayes can be used.


In [ ]:
def sigmoid(x, params=(1., 1., 0., 0.)): 
    scale, rate, xoffset, yoffset = params
    return np.abs(scale) / (1. + np.exp(np.abs(rate)*(-x + xoffset))) - yoffset

N = 50
noise = 0.1
X = np.sort(np.random.uniform(-6, 6, N)).reshape((N, 1))
y = sigmoid(X, (2, 2, 1, 0.5)) + np.random.normal(scale=noise, size=(N, 1)) 

utils.plot(({'x': X, 'y': y, 'linestyle': '', 'marker': '.', 'markersize': 5, 'color': 'k'},))

Maximising the marginal likelihood is seldom a convex problem. However, using convex optimisers to locate locally optimal hyper-parameters is common. Although these parameters may not be globally optimal, this practice is a computationally cheap solution for locating useful parameters. If it is clear local optima are affecting the solution, multi-start optimisation or more complicated methods of setting the hyper-parameters will be required.

In the following block, the method empirical_bayes uses a convex optimiser to maximise the marginal likelihood with respect to the sigmoid parameters.


In [ ]:
# Create Bayesian linear model.
blm = BayesianLinearModel(basis=sigmoid)

# Fit parameters using empirical Bayes.
x0 = np.array([1., 1., 0., 0.])
blm.empirical_bayes(x0, X, y)

M = 100
xs = np.linspace(-6, 6, M).reshape((M, 1))
mu, S = blm.predict(xs, variance=True)
utils.plot(({'x': xs, 'y': mu, 'color': 'r', 'linewidth': 2},
            {'x': xs, 'y': mu - S, 'color': 'r', 'linestyle': '--'},
            {'x': xs, 'y': mu + S, 'color': 'r', 'linestyle': '--'},
            {'x': X, 'y': y, 'linestyle': '', 'marker': '.', 'markersize': 5, 'color': 'k'},))